1 R Markdown

This is a week report for capstone projetc of JHU Capstone Course by Rongbin Ye, as a part of the data scientist specialization on Coursera.This report will provide a report of summary statistics about the data sets, report some interesting findings via term frequency, N gram and word cloud. Furthermore, based on the explainatory data analysis outcome, this report will propose a plan of capstone project, including a prediction algorithm for typing recommendation and a Shiny app on the AWS.

2 Milestone Plan

As discussed in the begining, this report will provide a milestone plan throughoout this whole capstone project. The scheduled time is in seven weeks. Yet, in regarding the time contraint, the milestones has been set in an accelerated manner, which entails tighted up plans to clean up the data and schedule of model development. The major deliverables are two: 1. An algorithm of recommendation the correlated the words in English, based on the given dataset. 2. A Shiny App for the usage of the users to interact with.

The milestone has been set to be following four sections: > Week 1 & 2: Data Familarization + Explantory Data Cleaning > Week 3 & 4: Text Mining: Extract and identify the key patterns related to the usage habbits of english writers > week 5 & 6: Model Development: An Algorithm to be developed and tuned in this process > Week 7: Develop a Shiny App and Delopy

3 Explanatory Analysis

3.1 Preparation

3.1.1 Load Libraries

library(tidyverse)
library(tm)
library(lexicon)
library(stringr)
library(stopwords)
library(tidytext)
library(textstem)
library(tidyr)

3.2 Load Data

In this section, using the connection, the txt file has been read into R lines by lines. The texts are stored in characters’ form. Based on these three characters chuncks, the author is able to conduct the preliminary data cleaning and text preprocessing for explantory data analysis

# read in multiple lines into one data frame: blogs
blogs_con <- file("~/Downloads/final/en_US/en_US.blogs.txt")
blogs <- readLines(con = blogs_con)
close(blogs_con)
# all the blogs has been loaded in properly
# read in multiple lines into one data frame: twitters
twitters_con <- file("~/Downloads/final/en_US/en_US.twitter.txt")
twitter <- readLines(con = twitters_con)
## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul
close(twitters_con)
# all the twitters are read into the properly

# read in multiple lines into one data frame: news
news_con <- file("~/Downloads/final/en_US/en_US.news.txt")
news <- readLines(con = news_con)
close(news_con)
# all the twitters are read into the properly

4 Basic Information

After loading in the data, let us look at the basic shape of three data sets.

len_news <- length(news)
len_blogs <- length(blogs)
len_twitters <- length(twitter)

The length of news is 1010242 in sentences. The length of blogs is 899288 in sentences. The length of twitters is 2360148 in sentences. Let us have a close look at the words.

5 Text Preprocessing

In order to conduct analysis, three text will be cleaned. The major cleaning process includes lowercase, strip spaces on both sides, taking out the punctuation, lemmatization, and tokenization. Eventually, the text chucks are expected to be broken down into tokens for further analysis and comparison at the level of sentences and words.

blogs <- blogs %>%  removeNumbers() %>% removePunctuation() %>% lemmatize_strings()

news <- news %>%  removeNumbers() %>% removePunctuation() %>% lemmatize_strings()

twitter <- twitter %>% removeNumbers() %>% removePunctuation() %>% lemmatize_strings()

6 Frequency Analysis

6.0.1 blogs

blogs_df <- tibble(line = 1:length(blogs), text = blogs)

tokens_blogs <- blogs_df  %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
tokens_blogs
## # A tibble: 349,020 x 2
##    word        n
##    <chr>   <int>
##  1 time   105751
##  2 day     70910
##  3 people  61543
##  4 love    58575
##  5 life    48685
##  6 feel    48000
##  7 book    41394
##  8 start   38799
##  9 im      37863
## 10 week    37765
## # … with 349,010 more rows

Surprisingly, in the blog, five words bloggers wrote most about are: time, people, day, love and life. Such a philosophical and metaphysical finding demonstrate the salt of the earth, which shows the potential topics. One of the preliminary conclusions is that these blogs experts m

6.0.2 News

news_df <- tibble(line = 1:length(news), text = news)

tokens_news <- news_df  %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
tokens_news
## # A tibble: 284,001 x 2
##    word        n
##    <chr>   <int>
##  1 time    65807
##  2 people  48738
##  3 game    48072
##  4 school  46734
##  5 day     43829
##  6 play    43596
##  7 city    41282
##  8 include 39438
##  9 team    38495
## 10 call    34729
## # … with 283,991 more rows

Meanwhile, it seems the topics in the news are more about time, people, game, school and day. Indeed, one of the preliminary thoughts is that the news covered might be the sport news. Despite not covering any specific sports, the news is about the season, game, and team, which supports the preliminary finding.

tokens_news$org <-"news"   
tf_idf_news <- tokens_news %>% tidytext::bind_tf_idf(word,org, n)
tf_idf_news %>% arrange(desc(tf_idf))
## # A tibble: 284,001 x 6
##    word        n org        tf   idf tf_idf
##    <chr>   <int> <chr>   <dbl> <dbl>  <dbl>
##  1 time    65807 news  0.00431     0      0
##  2 people  48738 news  0.00319     0      0
##  3 game    48072 news  0.00314     0      0
##  4 school  46734 news  0.00306     0      0
##  5 day     43829 news  0.00287     0      0
##  6 play    43596 news  0.00285     0      0
##  7 city    41282 news  0.00270     0      0
##  8 include 39438 news  0.00258     0      0
##  9 team    38495 news  0.00252     0      0
## 10 call    34729 news  0.00227     0      0
## # … with 283,991 more rows

To further consider the frequencies of words, as one can see that, including the adjusted term frequency does not help us to identify the keywords. The reason is that all these words are common words. Hence, using the term frequency solely is enough to help us to understand the korpus already.

6.1 Word Cloud - news

# To summarize the existing information, let us develop a word cloud accordly. 
wordcloud::wordcloud(words = tf_idf_news$word, freq = tf_idf_news$tf, max.words = 10, colors = TRUE)

### Twitter

twitter <- twitter %>%  removeNumbers() %>% removePunctuation() %>% lemmatize_strings()

twitters_df <- tibble(line = 1:length(twitter), text = twitter)

tokens_twitters <- twitters_df  %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
tokens_twitters
## # A tibble: 455,532 x 2
##    word         n
##    <chr>    <int>
##  1 im      157965
##  2 love    120102
##  3 day     108818
##  4 rt       88743
##  5 time     84990
##  6 lol      66791
##  7 follow   66295
##  8 people   52817
##  9 happy    49539
## 10 tonight  43912
## # … with 455,522 more rows
tokens_twitters$org <- "twitters"
tokens_blogs$org <-"blogs"
tf_idf_twitter <- tokens_twitters %>% tidytext::bind_tf_idf(word,org, n)
tf_idf_blogs <- tokens_blogs %>% tidytext::bind_tf_idf(word,org, n)

Yet, one of the issues of pure In order to adjust the influence of high frequent words, I adopt the inverse term frequency and expand the n into tf-idf model. Using the blind_df_idf function, the tf-idf metrics are provided.

6.2 Summary of All Three Files

tokens_all <- rbind(tokens_news, tokens_twitters)
tokens_all <- rbind(tokens_all, tokens_blogs)
tf_idf_all <- tokens_all %>% tidytext::bind_tf_idf(word,org, n)
news_20 <- top_n(tf_idf_news, 20, wt = tf) %>% select(word)
blogs_20 <- top_n(tf_idf_twitter, 20, wt = tf) %>% select(word)
twitter_20 <- top_n(tf_idf_blogs,20, wt = tf) %>% select(word)
all_20 <- cbind(news_20, blogs_20)
all_20 <- cbind(all_20, twitter_20)
colnames(all_20) <- c("Top News", "Top Blogs", "Top Tweets")
all_20
##    Top News Top Blogs Top Tweets
## 1      time        im       time
## 2    people      love        day
## 3      game       day     people
## 4    school        rt       love
## 5       day      time       life
## 6      play       lol       feel
## 7      city    follow       book
## 8   include    people      start
## 9      team     happy         im
## 10     call   tonight       week
## 11  percent     night      write
## 12     home      feel      leave
## 13      run     watch       read
## 14  million      hope      world
## 15   county     youre       call
## 16    start      game        don
## 17     week      life       home
## 18   season     tweet     friend
## 19      win     start        lot
## 20  company      week       post

7 N-Gram Analysis

After exploring the frequencies of the single words, let us look at the connection among words. In this session, the exploration will focus in two groups of the connections: connection of two words and three words seperately. ## Bigram Analysis ### Blogs

bigram_blogs <- blogs_df %>% unnest_tokens(bigram, text, token = "ngrams", n=2)
bigram_blogs_final <- bigram_blogs %>% 
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  count(word1, word2, sort = TRUE)
T10_blogs <- top_n(bigram_blogs_final, 10, wt = n)
T10_blogs
## # A tibble: 10 x 3
##    word1  word2      n
##    <chr>  <chr>  <int>
##  1 spin   dry     3767
##  2 week   ago     1860
##  3 ice    cream   1472
##  4 blog   post    1402
##  5 social medium  1332
##  6 jesus  christ  1318
##  7 month  ago     1234
##  8 south  africa  1206
##  9 spend  time    1075
## 10 olive  oil     1041

7.0.1 News

bigram_news <- news_df %>% unnest_tokens(bigram, text, token = "ngrams", n=2)
bigram_news_final <- bigram_news %>% 
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  count(word1, word2, sort = TRUE)
T10_news <- top_n(bigram_news_final, 10, wt = n)
T10_news
## # A tibble: 10 x 3
##    word1     word2         n
##    <chr>     <chr>     <int>
##  1 st        louis      8947
##  2 los       angeles    5189
##  3 san       francisco  4356
##  4 health    care       3586
##  5 school    district   3073
##  6 vice      president  2884
##  7 san       diego      2638
##  8 police    officer    2292
##  9 white     house      2276
## 10 executive director   2159

Well, the bigram uncover the geological coverage of the news that presented in the data. St.Lous, Los Angeles, San Francisco, San diego and probably DC(white house) are three major regions these news reports are covering majorly.

7.0.2 Twitter

bigram_twitters <- twitters_df %>% unnest_tokens(bigram, text, token = "ngrams", n=2)
bigram_twitters_final <- bigram_twitters %>% 
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  count(word1, word2, sort = TRUE)

T10_twitters <- top_n(bigram_twitters_final, 10, wt = n)
T10_twitters
## # A tibble: 10 x 3
##    word1  word2        n
##    <chr>  <chr>    <int>
##  1 happy  birthday  8355
##  2 mother day       5429
##  3 im     gonna     4145
##  4 social medium    3773
##  5 happy  mother    3371
##  6 stay   tune      2532
##  7 san    diego     2198
##  8 im     glad      2057
##  9 rt     rt        2051
## 10 happy  friday    1941

8 Conclusion

Through the exploration, one could discover there are some key patterns in these characters. The summary of major keywords are presented here in the form of cloud. ## Word Cloud - All

All_join_summary <- inner_join(tf_idf_news, tf_idf_blogs, by = "word")
All_join_summary <- inner_join(All_join_summary, tf_idf_twitter, by = "word")
#All_join_summary %>% arrange(desc())
All_join_summary$tf_idf_all <- All_join_summary$tf_idf.x + All_join_summary$tf_idf.y
All_join_summary$tf_all <- All_join_summary$tf.x + All_join_summary$tf.y + All_join_summary$tf
All_join_summary <- All_join_summary %>% arrange(desc(tf_all))
All_join_summary
## # A tibble: 72,225 x 18
##    word    n.x org.x    tf.x idf.x tf_idf.x    n.y org.y    tf.y idf.y tf_idf.y
##    <chr> <int> <chr>   <dbl> <dbl>    <dbl>  <int> <chr>   <dbl> <dbl>    <dbl>
##  1 time  65807 news  4.31e-3     0        0 105751 blogs 0.00764     0        0
##  2 day   43829 news  2.87e-3     0        0  70910 blogs 0.00512     0        0
##  3 im    17412 news  1.14e-3     0        0  37863 blogs 0.00273     0        0
##  4 love  13609 news  8.90e-4     0        0  58575 blogs 0.00423     0        0
##  5 peop… 48738 news  3.19e-3     0        0  61543 blogs 0.00444     0        0
##  6 feel  19514 news  1.28e-3     0        0  48000 blogs 0.00347     0        0
##  7 life  20825 news  1.36e-3     0        0  48685 blogs 0.00352     0        0
##  8 start 31863 news  2.08e-3     0        0  38799 blogs 0.00280     0        0
##  9 week  31465 news  2.06e-3     0        0  37765 blogs 0.00273     0        0
## 10 foll… 13113 news  8.58e-4     0        0  17120 blogs 0.00124     0        0
## # … with 72,215 more rows, and 7 more variables: n <int>, org <chr>, tf <dbl>,
## #   idf <dbl>, tf_idf <dbl>, tf_idf_all <dbl>, tf_all <dbl>
# To summarize the existing information, let us develop a word cloud accordly. 
wordcloud::wordcloud(words = tf_idf_news$word, freq = tf_idf_news$tf, max.words = 10, colors = TRUE)

## Word Cloud - Blogs

# To summarize the existing information, let us develop a word cloud accordly. 
wordcloud2::wordcloud2(data = tf_idf_blogs)

## Word Cloud - Twitter

# To summarize the existing information, let us develop a word cloud accordly. 
wordcloud::wordcloud(words = tf_idf_twitter$word, freq = tf_idf_twitter$tf, max.words = 10, colors = TRUE)

8.1 N-Gram Analysis

After our analysis of bigram, many meaningful connections among words were discovered and one could use this as the foundation of the development of the recommendation system for users in the future, if there is a demand for developing a recommendation system based on the existing literature.

8.2 Application of Findings

The exploration in the distribution of words and connection of words enables us to further explore the connections between the words and words, in sentences and words. The bigram and trigram shed lights on the recommendation system.